In [1]:
%matplotlib inline
import numpy as np
import scipy.stats as stat
from matplotlib import pyplot as plt
After playing around with the previous option (in the notebook GenerateProbs) and thinking a little more about what makes a distribution "normal," I've come up with the following method that I think will work nicely for generating probabilities for a general coupon collector simulation.
My idea starts with the idea of a quantile function (called a percent point function (ppf) in SciPy's stats library). This is because the ppf is a way to get the probability of most of the coupons centered around some number with a few coupons having lower probabilities and a few coupons having higher probabilities; a ppf is the inverse of a pdf which would give the number of coupons having that probability.
My idea continues by knowing that normal distributions are symmetrical around some number (the mean) and that the mean of the coupons' probabilities must be $\frac{1}{n}$. This is because there are $n$ coupons and the sum of their probabilities must be 1, so the mean must be $\frac{1}{n}$.
The only parameter that defines a normal distribution other than the mean is the standard deviation (or alternately the variance), and this can change to give us a different "spread" of probabilities around the mean of $\frac{1}{n}$. The higher the standard deviation, the greater the spread, just like we expect.
While this may not be the absolute best way to generate coupon probabilities, it seems like this results in a set of probabilities that meets criteria that most people would call "normal." Let's look at an example and see how changing the standard deviation around affects the shape of the probabilities.
In [2]:
n = 20 #number of coupons
mu = 1/n #this is the mean coupon probability
sigma = mu/2 #this is the std dev parameter we will play around with - it seems to make sense to express it in terms of the mean
x = np.arange(n)+0.5 #arange goes from 0 to n-1, and I want it to go from 1 to n
p_x = stat.norm.ppf(x/(n), mu, sigma)
print('mean prob = %1.16f' % np.mean(p_x), ', sum of all probs = %1.16f' % np.sum(p_x) )
plt.bar(x, p_x, align='center')
plt.axis([0,n,0,2/n])
Out[2]:
In [3]:
n = 20 #number of coupons
mu = 1/n #this is the mean coupon probability
sigma = mu/10 #this is the std dev parameter we will play around with - it seems to make sense to express it in terms of the mean
x = np.arange(n)+0.5 #arange goes from 0 to n-1, and I want it to go from 1 to n
p_x = stat.norm.ppf(x/n, mu, sigma)
print('mean prob = %1.16f' % np.mean(p_x), ', sum of all probs = %1.16f' % np.sum(p_x) )
plt.bar(x, p_x, align='center')
plt.axis([0,n,0,2/n])
Out[3]:
Notice that as we decreased the standard deviation (from half of the mean to a tenth of the mean) our spread of probabilities got a lot smaller. Also notice that this method doesn't require any scaling of the probabilities to get them to sum to 1. This gives me a warm and fuzzy feeling that this is a good way to generate coupon probabilities.
Obviously the minimum value would be 0. This would give us the coupon collector's problem with a uniform distribution, so it is a good way to check our work. But what is the max value? From the example below, we see that there is clearly a point where the first few coupons have negative probabilities. This should be avoided:
In [4]:
n = 20 #number of coupons
mu = 1/n #this is the mean coupon probability
sigma = mu #this is the std dev parameter we will play around with - it seems to make sense to express it in terms of the mean
x = np.arange(n)+0.5 #arange goes from 0 to n-1, and I want it to go from 1 to n
p_x = stat.norm.ppf(x/n, mu, sigma)
print('mean prob = %1.16f' % np.mean(p_x), ', sum of all probs = %1.16f' % np.sum(p_x) )
plt.bar(x, p_x, align='center')
plt.axis([0,n,-0.1,3/n])
Out[4]:
So clearly there is a max value where our probabilities don't make sense. This max value varies based on what our n is. We can calculate it by knowing things about the normal quantile function. To start, we define the cdf of a normal distribution as $F(p, \mu, \sigma)$. The normal quantile function is simply the inverse of this: $F^{-1}(p, \mu, \sigma)$. As is common with the normal distribution, there is no closed form for the normal quantile function, but it can be viewed as a transform of the standard normal quantile function (also called the probit function) of which there are good approximations for. The function is $F^{-1}(p, \mu, \sigma) = \mu + \Phi(p)*\sigma$. In generating probabilities for the CCP, we use $F^{-1}(\frac{i}{n+1}, \frac{1}{n}, \sigma), \; where \; i=1,2,3,...,n$. All we need to do is find the max sigma that keeps $F^{-1}(\frac{i}{n+1}, \frac{1}{n}, \sigma) > 0$. Piece of cake.
Step 1: set function greater than zero: $F^{-1}(\frac{i}{n+1}, \frac{1}{n}, \sigma) = \frac{1}{n} + \Phi(\frac{1}{n+1})\sigma>0$
Step 2: solve for sigma: $\sigma < -\frac{1}{n\Phi(\frac{1}{n+1})}$
Step 3: explain why the equality sign flipped: I divided by $\Phi(\frac{1}{n+1})$ when solving for $\sigma$. $\Phi(\frac{1}{n+1})$ is always negative in this example because n must be a positive integer, and the probit function is negative for all arguments less than 0.5.
Step 4: rewrite in terms of the error function: $\sigma < -\frac{1}{\sqrt{2}n \; erf^{-1}(\frac{2}{n+1}-1)}$
Step 5: Write as bounds for $\sigma$: $0<\sigma<-\frac{1}{\sqrt{2}n \; erf^{-1}(\frac{2}{n+1}-1)}$
As the population n grows very large, our maximum allowable value for sigma approaches our minimum value for sigma (0). Consider the inequality for sigma in terms of the probit function: $\sigma < -\frac{1}{n\Phi(\frac{1}{n+1})}$. From here we can see that as $n \rightarrow \infty$, the denominator of the fraction approaches negative infinity ($\Phi \rightarrow -\infty$), so the overall fraction approaches zero. I noted earlier that a $\sigma=0$ implies the uniform distribution, and we could use this as an approximation. It makes sense that for very large n there is less allowable deviation from the mean because 1/n is very small and any positive deviation in the probabilities is symmetrically offset by a negative deviation in the probabilities but may not be negative.
In [ ]: